scCAN: single-cell clustering using autoencoder and network fusion

Unsupervised clustering of single-cell RNA sequencing data (scRNA-seq) is important because it allows us to identify putative cell types. However, the large number of cells (up to millions), the high-dimensionality of the data (tens of thousands of genes), and the high dropout rates all present substantial challenges in single-cell analysis. Here we introduce a new method, named single-cell Clustering using Autoencoder and Network fusion (scCAN), that can overcome these challenges to accurately segregate different cell types in large and sparse scRNA-seq data. In an extensive analysis using 28 real scRNA-seq datasets (more than three million cells) and 243 simulated datasets, we validate that scCAN: (1) correctly estimates the number of true cell types, (2) accurately segregates cells of different types, (3) is robust against dropouts, and (4) is fast and memory efficient. We also compare scCAN with CIDR, SEURAT3, Monocle3, SHARP, and SCANPY. scCAN outperforms these state-of-the-art methods in terms of both accuracy and scalability. The scCAN package is available at https://cran.r-project.org/package=scCAN. Data and R scripts are available at http://sccan.tinnguyen-lab.com/


Evaluation metrics
We use three different metrics for comparing the obtained partitions with the known cell types: adjusted Rand index (ARI) 29 , adjusted mutual information (AMI) 30 , and V-measure 31 . To evaluate the capability of each method in predicting the true number of clusters, we use absolute log-modulus 32 .

Adjusted Rand index
Rand index (RI) evaluates the similarity between predicted clusters and true cell types. Given P as a set of clusters and Q is a set of true cell types then RI is calculated as: where t is the number of pairs belonging to the same cell type in Q and are grouped together in the same cluster in P, u is the number of pairs of different cell types in Q and are grouped to different clusters in P, v is the number of pairs of the same cell types in Q and are grouped to different clusters in P, s is the number of pairs in different cell types in Q and are grouped together in the same cluster in P, N is the total number of cells, and N 2 is the number of possible pairs. In brief, RI measures the ratio of pairs that are clustered in the same way (either together or different) from two partitions (e.g. 0.80 means 80% of pairs are grouped in the same way). The Adjusted Rand Index (ARI) 29 is the corrected-for-chance version of the Rand Index. The ARI values ranged from -1 to 1 in which 0 indicates a random grouping. The ARI score is calculated as :

Adjusted mutual information
Adjusted mutual information (AMI) is an adjustment of the mutual information (MI) score to account for random partitioning. It accounts for the fact that the MI is generally higher for two clusters with a larger number of clusters, regardless of whether there is actually more information shared. The calculation of AMI is presented as follows: Given a dataset of n cells with true partition X = {X 1 , X 2 , ..., X R } of R clusters and predicted partition Y = {Y 1 ,Y 2 , ...,Y C } of C clusters. The mutual information of cluster overlap between X and Y can be summarized as a contingency table M R×C = [n i j ], where i = 1...R, j = 1...C, and n i j represents the number of common data point falls into cluster X i is p(i) = |x i | n . The entropy associated with the clustering X is calculated as follows: H(X) gives outputs as non-negative values where 0 indicates that there is one cluster in the dataset. The mutual information (MI) between two clusters X and Y is calculated as follows: where P(i, j) is the cell that is classified to both clusters X i in X and Y j in Y . P(i, j) is calculated as follows:

6/33
MI gives outputs as non-negative values bounded by the entropies H(X) and H(Y ) and 0 indicates that there is no cell classified to the same cluster. To correct for the fact that two random clusterings do not give a constant value, and tends to be larger when the two partitions have a larger number of clusters. Therefore, AMI is defined as follows: where E{MI(X,Y )} is the expected mutual information between two random clusterings. The AMI takes value between 0 and 1 where 0 stands for random clustering and 1 represents a perfect partition.

V-Measure
V-Measure is the harmonic mean between two measures: homogeneity and completeness. Homogeneous clustering is when each cluster has data points belonging to the same class. Complete clustering is when all dat a points belonging to the same class are clustered into the same cluster. Given a set of classes C = {C 1 ,C 2 , ...,C l }, a set of cluster K = {K 1 , K 2 , ..., K m } and the conditional entropy of the class distribution given the identified clustering is computed as H(C|K). The homogeneity is defined as follows: The completeness is symmetrical to homogeneity. To measure the completeness, the distribution of cluster assignments within each class is assessed. In a perfect clustering, each of these distributions will be completely skewed to a single cluster. Given the homogeneity h and completeness c, the V-measure is computed as the weighted harmonic mean β between homogeneity and completeness: if β is greater than 1, completeness is weighted more strongly in the calculation. If β is less than 1, homogeneity is weighted more strongly. Since the computations of homogeneity, completeness and V-measure are completely independent of the number of classes, the number of clusters, the size of the dataset and the clustering algorithm, these measures can be employed for evaluating any clustering solution.

Absolute symmetric log-modulus
To evaluate the accuracy of methods in estimating the correct number of clusters, we used absolute symmetric log-modulus 32 transformation defined as follows: where x is the difference between the estimated number of clusters and the true number of cell types in a given dataset. The higher values of absolute log-modulus transformation mean the number of estimated clusters is more different from the number of true cell types. x equals to zero denotes the perfect estimation.   We plotted both the original and new cluster annotations in the original t-SNE/UMAP plots. We also quantified the correlation between the two annotations (true cell types and clustering) using Rand index (RI). RI measures the agreement between a cluster annotation and the true cell types. In short, RI = (a + b)/ N 2 where a is the number of pairs that belong to the same true cell type and are clustered together, b is the number of pairs that belong to different true cell types and are not clustered together, and N 2 is the number of possible pairs that can be formed from the N cells. Intuitively, RI is the fraction of pairs that are grouped in the same way (either together or not) in the two annotations compared (e.g. 0.9 means 90% of pairs are grouped in the same way). Figures S1-S5 show the annotations on the original t-SNE plots. For each dataset, we plotted the two annotations side-by-side: the left panel shows the original annotation whereas the right panel shows the new cluster annotation. Figures S6-S10 shows the annotations on the original UMAP plots. For each dataset, we quantified the similarity between the two annotations using Rand index (RI). An RI of 1 indicates the ideal case in which the two annotations are identical. The average RI across all datasets is 0.93. This indicates a strong correlation between the two annotations on the original t-SNE/UMAP plots.

Comparison of the clustering methods used in Modules 2 and 3
The first method (core method) is more accurate but it requires more computational power and memory. Therefore, we developed the second method that allows users to analyze large datasets faster and using less memory. If the input dataset is small (by default 5,000 cells or less), both methods will be the same and thus produce the same results. When the dataset is large (5,000 cells or more), we use the first method to analyze a subset of the data to determine the cell types and then assign the the remaining cells to the determined cell types (second method).
Note that the default value of 5,000 allows us to have a sufficiently large sample size to properly determine the cell types which in turns will lead to a proper classification of the remaining cells. At the same time, 5,000 is a reasonable small number of samples that allows users to perform the analysis efficiently using personal computers. Users can also change this parameter to use the first method even for large datasets, if they have more memory and are willing to wait longer for their results. In the following text, as requested, we will provide a direct comparison between the two methods in terms of both accuracy and running time. Table S8 shows a direct comparison of the two methods in terms of both accuracy and running time using the same server (with 200 GB of RAM). Consistent with the previous submission, we used adjusted Rand index (ARI), adjusted mutual information (AMI), and V-measure to assess the performance of each method. Cells with NA values indicate that a method was not able to analyze the dataset (out-of-memory). Cells highlighted in bold have the higher accuracy (ARI, AMI, and V-measure) and lower running time.
Overall, the first method can only analyze the first 21 datasets. It returns NA for the last seven datasets with 44,808 cells or more (out of memory). The second method can analyze all datasets, even for the Cao dataset with more than a million cells.
Regarding running time, the second method is substantially faster than the first method. For example, the second method was able to analyze the Zilionis dataset in 18 minutes while it takes the first method method almost 3 days. For the Cao dataset with a million cells, the second method finished the analysis in less than 40 minutes whereas the first method ran out of memory and could not analyze the data.
Regarding the accuracy, the first method is slightly more accurate (when they can analyze the data) but the difference between the two methods is marginal. For example, the first method has a higher ARI in three dataset (Guo, Chen, and Slyper) but has lower ARI in three other datasets (Montoro, Kanton, and Zilionis). Similarly, the two methods have comparable AMI and V-measure values.
In summary, the first method is slightly more accurate but the second method is capable of analyzing large datasets and requires less memory and running time. Therefore, the scCAN software automatically switches to the second method when analyzing datasets with 5,000 cells or more. Users can adjust this parameter if they wish to run the first method for larger datasets, given that they have sufficient memory and are willing to wait longer for the results. Table S8. Performance of the two clustering methods used in Module 2 (method 1) and Module 3 (method 2) on single-cell datasets measured by adjusted Rand index (ARI), adjusted mutual information (AMI), V-measure and running time (minutes). Cells with NA values indicate that the method was not able to analyze the dataset (out-of-memory). Cells highlighted in bold have the higher accuracy (ARI, AMI, and V-measure) or lower running time.

Effects of min-max scaling
The min-max scaling is not a scRNA-seq normalization method and it is not intended to do so. We leave the step of data processinga and normalization completely up to the users. This min-max scaling added to our method is used on top of the already normalized data provided by users. Such scaling is frequently used in deep learning models [33][34][35][36] with the common purpose of reducing standard deviation and suppressing the effect of outliers. Below, we will demonstrate that the min-max scaling step improves the clustering performance without altering the transcriptome landscape.
To demonstrate the usefulness of this min-max scaling on clustering, we re-analyzed all single-cell datasets using scCAN without applying the min-max scaling step. Figure S11 shows the ARI values obtained from scCAN in two scenarios: scCAN with and without the scaling step. Overall, the min-max scaling makes the analysis more robust (lower variance) and more accurate (higher ARI). This result demonstrates the usefulness of the min-max scaling in improving the performance of scCAN. To further demonstrate that the min-max scaling does not alter the transcriptome landscape of the data, we calculated the distance correlation index (dCor) 37 between the two dimensional representation of scaling and non-scaling data generated by t-SNE. Given X and Y as the 2D representation of the scaling and non-scaling data, dCor is calculated as dCor = where dCov(X,Y ) is the distance covariance between X and Y while dVar(X) and dVar(Y ) are distance variances of X and Y . Specifically, dCor first calculates the pair-wise distances for X by computing the distance between each pair of cells, resulting in a square matrix. Second, it calculates the pair-wise distances for Y . Finally, it compares the two matrices using the formula described above to obtain the distance correlation. The dCor coefficient has values ranging from 0 to 1, with the dCor is expected to be 1 for a perfect similarity. In our analysis, when we rotate the transcriptome landscape, dCor does not change. We re-analyzed the single-cell datasets and calculate the distance correlation for each dataset. Overall, the dCor values obtained from all datasets are very high (median dCor of 0.99 and variance of 0.01). This confirms that the min-max scaling does not alter the transcriptome landscape of the data while improving the clustering results.

26/33
The sampling process is necessary to reduce both time and space complexity, but it can alter the capability of detecting rare cell types. By selecting 5,000 cells from a large dataset, we might end up with insufficient number of rare cells, and therefore reduce the chance of detecting them.
In addition, we have developed two strategies to enhance the method's capability of detecting rare cell types. First, we now allow users to change the parameter samp.size so that they can increase the sample size, thus boosting the method's capability in detecting rare cell types. Second, we provide an instruction to perform multi-state clustering, i.e., further splitting the clustering results. When a cell type has too few cells, these cells are often mistakenly grouped with other cell types. By further splitting each clusters, we are able to detect rare cell types that would not be possible by performing one-stage clustering.
To demonstrate the efficiency of both solutions, we have tested them on the Zilionis dataset. The Zilionis dataset has 34,558 cells and 9 cell types. The transcriptome landscape and the cell types of the dataset are shown in Figure S12A. Among the 9 cell types, the tRBC cell type has only 108 cells (0.3%). A sub-sample of 5,000 cells is expected to have approximately 19 tRBC cells, which might be insufficient for many clustering method to detect them. Indeed, as show in Figure S12B, scCAN mistakenly grouped tRBC cells with tPlasma cells when we used the default setting of samp.size = 5, 000.
To demonstrate the efficiency of the first strategy, we set samp.size = 10, 000. The clustering results using the new parameter is shown in Figure S12C. With a sample size of 10,000, the method can properly separate tRBC cells and assigned them to cluster 2. To quantify how well the method separates tRBC cells from other cells, we calculated the F1 score 38 . Briefly, F1 = 2 * precison * recall precison + recall = T P T P + 1 2 (FP + FN) where: i) TP are tRBC cells that were correctly assigned to cluster 2, ii) FP are cells of other cell types that were mistakenly assigned to cluster 2, iii) and FN are tRBC cells but were not assigned to cluster 2. In the ideal case, FP=FN=0 which leads to F1=1. In the analysis shown in Figure S12C, F1 score is 0.9 which indicates that scCAN properly separated tRBC from the rest. The method is expected to perform even better if we further increase the sample size.
To demonstrate the efficiency of the second strategy, we performed a two-stage clustering using the the default setting of samp.size = 5, 000. In stage one, we partitioned the data using scCAN and obtained the clustering results as shown in Figure S12B. In stage two, we further partitioned each cluster obtained from stage one using the same method scCAN. The results of stage two are shown in Figure S12D. Cluster 2 were further split into two sub-clusters: 2 1 and 2 2. The tRBC cells were completely separated from the rest (cluster 2 2) with an F1 score of 1. This demonstrates that users can efficiently detect rare cell types using multi-stage clustering even with the default parameter samp.size = 5, 000. Figure S12. Rare cell type detection using the Zilionis dataset as example. The dataset has a total of 34,558 cells, in which there are 108 tRBC cells (rare cell type with 0.3% prevalence). (A) Transciptome landscape and true cell types. (B) Clustering results using scCAN with default sample size (samp.size = 5, 000), in which tRBC are mistakenly grouped with tPlasma cells. (C) Clustering results with sample size of 10,000 (samp.size = 10, 000). In this case, scCAN properly separates tRBC cells in cluster 2 with an F1 score of 0.9. (D) Clustering results using two-stage strategy and default sample size (samp.size = 5, 000). scCAN properly separates tRBC cells in cluster 2 with a perfect F1 score of 1.
To demonstrate the scalability of scCAN, we downloaded and analyzed the Brain 1.3M dataset (https:// genomebiology.biomedcentral.com/articles/10.1186/s13059-017-1382-0). Only scCAN and SCANPY were able to analyze this dataset of 1.3 million of cells. The clustering results of the two methods are shown in Figure S13. scCAN partitioned the data into 19 cluster whereas SCANPY partitioned the data into 20 clusters. The running time of scCAN and SCANPY were 51 minutes and 70 minutes, respectively. Note that we could not assess the accuracy of the two methods using this particular dataset because it does not have true cell type information. Second, we downloaded the Cao dataset 27 that contains 1,092,000 cells sequenced from the human cerebellum with known cell types. Again, only scCAN and SCANPY were able to analyze this dataset. Figure S14A shows the visualization of 2D t-SNE embedding data generated from raw data with original cells annotations while Figure S14B-C show the visualizations of Cao dataset using clusters generated from SCANPY and scCAN. SCANPY can cluster the whole dataset in 51 minutes with the ARI of 0.48 ( Figure S14B), while scCAN takes 39 minutes to partition cells with the ARI of 0.89 ( Figure S14C). We have updated the analysis results for the Brain 1.